arXiv:l502.00750v 1 [cs.CV] 3 Feb 2015 


RECOGNIZING FOCAL LIVER LESIONS IN CONTRAST-ENHANCED ULTRASOUND 
WITH DISCRIMINATIVELY TRAINED SPATIO-TEMPORAL MODEL 


Xiaodan Liang* Qingxing Cao* Rui Huang t Liang Lin * 

* Sun Yat-sen University t NEC Laboratories, China 


ABSTRACT 

The aim of this study is to provide an automatic compu¬ 
tational framework to assist clinicians in diagnosing Focal 
Liver Lesions (FLLs) in Contrast-Enhancement Ultrasound 
(CEUS). We represent FLLs in a CEUS video clip as an en¬ 
semble of Region-of-Interests (ROIs), whose locations are 
modeled as latent variables in a discriminative model. Dif¬ 
ferent types of FLLs are characterized by both spatial and 
temporal enhancement patterns of the ROIs. The model is 
learned by iteratively inferring the optimal ROI locations and 
optimizing the model parameters. To efficiently search the 
optimal spatial and temporal locations of the ROIs, we pro¬ 
pose a data-driven inference algorithm by combining effec¬ 
tive spatial and temporal pruning. The experiments show that 
our method achieves promising results on the largest dataset 
in the literature (to the best of our knowledge), which we have 
made publicly available. 

Index Terms — CEUS, FLLs, Spatio-Temporal Model, 

1. INTRODUCTION 

Liver cancer is the third cause of cancer-related death ID- vi- 
sualization of Focal Liver Lesions (FLLs) has been attempted 
by employing various imaging techniques. Ultrasound is of¬ 
ten performed in the diagnostics due to its low cost, efficiency 
and non-invasiveness. The use of Contrast-Enhanced Ultra¬ 
sound (CEUS) can further assess the contrast enhancement 
(i.e., the intensity of the FLL area relative to that of the adja¬ 
cent parenchyma) patterns of FLLs, which has markedly im¬ 
proved the accurate diagnosis of FLLs H). As shown in FigJTJ 
temporal enhancement patterns typically characterize the be¬ 
nign or malignant FLLs (e.g., sustain enhancement in the last 
two vascular phases for benign and hypo-enhancement for 
malignant FLLs). On the other hand, spatial enhancement 
patterns during the arterial phase often characterize the spe¬ 
cific types of FLLs. 

Extensive research efforts have been made to assist the 
experts in diagnosing different types of cancers and, in par¬ 
ticular, FLLs using ultrasound images OlO. However, the 
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Fig. 1. The enhancement pattern ROIs of three differ¬ 
ent FLLs: Hemangioma (HEM), Focal Nodular Hyperplasia 
(FNH), Hepatocellular Carcinoma (HCC), in three different 
phases: the arterial, portal venous and late phases. The HEM 
and FNH are benign FLLs and HCC is a malignant FLL. 

application of CEUS for differentiating FLLs is still a rel¬ 
atively new technique HdSEDCZ) A cascade of Artificial 
Neural Networks HI is employed to classify FLLs based on 
manually segmented lesion regions. Anaye et al. [51 analyzes 
the Dynamic Vascular Patterns (DVPs) of FLLs with respect 
to surrounding healthy parenchyma to differentiate between 
benign and malignant FLLs. In m , Bakas et al. track a man¬ 
ually initialized FLL and its surrounding parenchyma to char¬ 
acterize it as either benign or malignant based on its vascular 
signature. In their recent work Q, an automated method for 
selection of the optimal frame for initialization of the FLL 
candidates is proposed. 

In all these works, varying degrees of manual interac¬ 
tions are required to identify the Regions of Interest (ROIs) 
of FLLs or the normal parenchyma. The manual annotations 
are highly dependent on the skills and knowledge of the ex¬ 
perts, leading to large variations in inter-/intra-observer im¬ 
age interpretations. Besides, the ever-increasing amount of 
CEUS data acquired and processed nowadays demands auto¬ 
matic computational systems that can save the radiologists’ 
time and efforts. In addition, most of the previous works fo¬ 
cused on differentiating between benign and malignant FLLs, 
or characterizing a specific type of FLLs. We, on the other 
hand, are trying to combine different enhancement patterns to 
recognize multiple different types of FLLs in a unified frame¬ 
work. 
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The main contributions of our work herein are threefold. 
First, we propose a fully automatic computational framework 
to recognize FLLs by modeling the locations of ROIs as la¬ 
tent variables in a discriminative model and combining both 
spatial and temporal enhancement patterns of the ROIs into 
the framework. Our model is then trained by a weakly su¬ 
pervised learning algorithm, which alternates between infer¬ 
ring the most probable spatial and temporal locations of the 
ROIs and optimizing the model parameters. Second, consid¬ 
ering that most of the video frames and the regions in each 
frame contain redundant or irrelevant information for recog¬ 
nizing FLLs, the automatic detection of optimal locations of 
the ROIs is made very efficient by a novel data-driven infer¬ 
ence method, which combines the spatial and temporal prun¬ 
ing techniques to disregard less discriminative frames and re¬ 
gions. The optimal ROI locations are then determined by dy¬ 
namic programming. Last but not least, a new region repre¬ 
sentation for ROIs is presented to capture the important and 
relevant ultrasonic characteristics of FLLs, which is not nec¬ 
essarily limited to our framework. 

We apply our method on a new dataset (namely SYSU- 
CEUS dataset) we collected and made public, which contains 
in total 353 CEUS video sequences of three types of FLLs 
(186 HCC, 109 HEM and 58 FNH), and is, to the best of our 
knowledge, the largest dataset in the literature. The experi¬ 
mental results demonstrate that our method achieves promis¬ 
ing performance without manual interactions. 

2. OUR MODEL 

2.1. Region representation 

The accurate classification of FLLs highly depends on the 
representation of the characteristics of the lesion regions (e.g., 
internal echo, morphology, edge, echogenicity and posterior 
echo enhancement). However, one single ROI R is often in¬ 
sufficient to capture all the ultrasonic characteristics. For in¬ 
stance, the region inside the lesion, denoted as R ~, can cap¬ 
ture the internal echo of the FLL; the lesion region R can 
be used to observe the boundary and the morphology of the 
FLL; and the tissue area surrounding the lesions, denoted as 
R + , can be used to measure the posterior echo enhancement. 
The echogenicity of the lesion can be measured by compar¬ 
ing the intensities of above regions. Thus, given an ROI R, 
the regions R~ and R + can be obtained by shrinking and en¬ 
larging R by a small factor, respectively. We then propose an 
effective region representation as following: 
f(R) = [f t (R-),f t (R),f t (R + ),f d (R-,R),f d (R,R + )} d) 

where f l extracts the appearance features of each region, 
such as Grey Level Co-occurrence Matrix(GLCM) and Local 
Phase(LP); f d calculates the mean intensity difference of two 
regions. Consequently, the concatenation of all these fea¬ 
tures, f(R), captures all the desired ultrasonic characteristics 
of this region R. 

2.2. Model representation 

Given a CEUS video sequence x, y is the corresponding class 
label of the FLL in this video, ranging over a finite set y (e.g., 


y={RCC, HEM, FNH}). We assume that the FLL can be 
compactly represented by a set of ROIs {Ri, R 2 ,..., R m } in 
three vascular phases: arterial, portal venous, and late phases. 
Intuitively, ROIs are the most discriminative regions for dis¬ 
tinguishing different FLLs. And each ROI Ri is a region ex¬ 
tracted from the video frame U, at the spatial location pi = 
(xi,yi, s^, where Xi,yi, Si are the coordinates and the scale 
of the ROI. The latent variables h = {h\, h 2 ,..., h m }, where 
hi = (piRi), is the location of Ri, taken values from a finite 
set Hi of all possible ROI locations. Given video x, its corre¬ 
sponding class label y, and latent variables h, the conditional 
probability of the recognition problem is defined as, 

p(y l x ;^)= 5^p(2/, h l x; ^) 

hen 
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where uo is the model parameter vector, H = Hi x H 2 x • • • x 
Hm , and ^(x, h, y) is a feature vector depending on the video 
sequence x, the class label y, and the latent variables h. We 
define the formulation of cc T • ^(x, h, y) as the following, 
including two terms: unary potential and pairwise potential, 
cj T • ^(x, h ,y) = ^aj ■ 0 u (x, y, hi) 

iem 
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where (j) u (-) is the unary potential function of variable hi and 
<fi p (•) is the pairwise potential function of (hi, hj). £ is the 
set of neighboring latent variables (defined for the pairs of 
temporally adjacent ROIs). 

1) Unary potential aj • </> w (x, y, hi): This singleton po¬ 
tential function </> w (-) models the compatibility between class 
label y and appearance of region Ri (note that Ri = x(/^)). 

aj ■ cf) u (x,y,hi ) = L Y ^ ' S v( a ) ' s hi(b) ■ ^ 

aeybeHi 

where /(x(/^)) is the feature vector describing the appear¬ 
ance of the region, as defined in section |2.1| The indicator 
function S y (a) is equal to one if y = a, zero otherwise. Sim¬ 
ilarly, 5^ (fy is equal to one if hi = b, zero otherwise. The 
parameter is simply the concatenation of all af. 

2) Pairwise potential /3jj • </> p (x, y, hi, hj): The potential 
function (j ) p {•) models the compatibility between class label 
y and the temporal transition of a pair of neighboring latent 
variables (hi, hj). 

Pi,j ■ <t> P {x,y,hi,hj) = ^2 L L 

aeybeHi cenj (5) 

P?,j ' 5 v {a) ■ 5 hi ( b ) • S hj (c) • / p (x, hi,hj ) 

where f p (-) includes two components: appearance vari¬ 
ance feature, computed by the difference of f(x.(hi)) and 
f(x.(hj)), and spatial displacement feature, i.e., Euclidean 
distance between the spatial coordinates of hi and hj. And 
the parameter faj is simply the concatenation of all /?“ ■. 



2.3. Learning 

Given a training set D = {(xi, ?/i),..., (x n , y n )}, the model 
parameter uj can be learned by maximizing the conditional 
log-likelihood on the training samples: N 

uj* — argmax£(cj) — arg max V C x (uj) 

U) UJ ' ^ 

i=1 
N 

= argmaxy~'logp(i/j|x i ; u) (6) 
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where /?(cc) denotes the conditional log-likelihood of the i th 
training example, defined in Eq([2]), and C(u) denotes the con¬ 
ditional log-likelihood of the whole training set. The objec¬ 
tive function C{uj) is not concave, due to the latent variables 
h. We adopt the latent structural SVM learning framework 
0, which alternates between inferring the latent variables h 
and optimizing the model parameter uj. The problem of in¬ 
ferring h can be solved efficiently using a data-driven infer¬ 
ence algorithm (Sec. |2.4| ), and the parameter optimization is 
a standard structural SVM training problem, solved by the 
cutting-plane algorithm. We repeat the above two steps un¬ 
til convergence. We use the one-vs-one binary classification 
strategy for multi-class classification problem. 

Given a learned model, the classification is achieved by 
first finding the best hypothesis {hi}™ for m ROIs, then pick¬ 
ing the FLL class with the highest SVM score. The score of 
an example x with a learned classifier is defined as: 

t( x .l/) = maxu T ^(x,j/,h) (7) 

n£7r 

2.4. Data-driven inference 

The inference task is to find the optimal locations of the ROIs 
(i.e., the latent variables h). However, the searching space will 
be very large if we consider all regions in all frames. Thus, we 
propose a data-driven inference algorithm, which efficiently 
combines the spatial and temporal pruning techniques to dis¬ 
regard less discriminative frames and regions. The optimal 
locations {hi}™ of the most discriminative ROIs can then be 
determined using dynamic programming. 

1) Temporal pruning: In a CEUS video, the appearance of 
ultrasound frames often varies slowly and smoothly according 
to the hemodynamic, and the most discriminative frames are 
usually those with the largest contrast changes compared with 
neighboring frames. Thus, a small set of candidate frames, 
which have local maximum of the contrast change, are au¬ 
tomatically selected. In particular, for each frame It, (t = 
1, • • • , T) in a video x, we compute the contrast feature v t 
from the co-occurrence distribution C t defined over I t (9). 
The contrast vector v is then (v\,V 2 ,...,vt)- Let Av be the 
gradient of v, the candidate frame set B is formed by finding 
the frames at the local maximum of Av. 

2) Spatial pruning: After temporal pruning, we also prune 
the less important regions by considering two priors: saliency 
prior and location prior. First, we believe that salient regions 
(e.g., having higher contrast or containing typical structures) 
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Table 1. Sensitivities and mean accuracies on characterizing 
benign and malignant FLLs. Sens means the sensitivity of the 
specific class. 
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88.9% 
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manual 

86.1% 

85.7% 

72.7% 

83.8% 

bruteforce 

83.3% 

80.1% 

36.4% 

75.0% 

baseline 

78.9% 

22.0% 

10.3% 

49.9% 


Table 2. Sensitivities and mean accuracies in the different 
experiment settings. 

have more discriminative information, and thus are more 
likely to be candidates of ROIs. Second, we observe that 
FLLs often appear in or close to the center of the images, 
probably because a skilled ultrasound operator usually places 
the liver area in the middle of the display. According to these 
two observations, we evaluate all the regions with different 
scales in each candidate frame I e B (sliding window proto¬ 
col), and only select the regions with prior probability larger 
than a threshold r as ROI candidates. The prior probability 

of a region r being an ROI is, 

p(r) = S(r)g(C r \C I ,a) (8) 

where S(r) is the normalized mean saliency of the region r in 
the saliency map S , computed by the quaternion-based spec¬ 
tral saliency method Go) on image /. C r and C 1 are the cen¬ 
troid of region r and the image /, respectively. Q(C r \C I , cf) 
is a Gaussian distribution. 

It is worth noting that the spatial pruning in the last two 
vascular phases (portal and late) can be more aggressive. This 
is because the contrast between FLLs and normal tissues is of¬ 
ten very low, and the locations of FLLs do not change much 
since the arterial phase. Thus, in the last two phases, we only 
search the regions in a spatial neighborhood around the lo¬ 
cations of ROI candidates found in the arterial phase. Fi¬ 
nally, given the model parameters and the observations, the 
latent variables h = {ft-i, ft- 2 ? • • • ? h m } form a hidden Markov 
model, and can be solved exactly by the Viterbi algorithm 

ED- 

3. RESULTS 

We test our method on the SYSU-CEUS dataset collected 
from the First Affiliated Hospital, Sun Yat-sen University, 
which is public availably The equipment used was Aplio 
SSA-770A (Toshiba Medical System). The dataset consists 
of three types of FLLs: 186 HCC, 109 HEM and 58 FNH 
instances (i.e., 186 malignant and 167 benign instances). All 
these instances with resolution 768 * 576 were taken from 
different patients, with large variations in appearance and en¬ 
hancement patterns (e.g., size, contrast, shape and location) 
of FLLs. We adopt the 5-fold cross validation training strat¬ 
egy and the sensitivity for each class and mean accuracy as 
the evaluation criteria, similar to 0. In our implementa- 

1 https://github.com/lemondan/Focal-liver-lesions-dataset-in-CEUS 
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Table 3. Comparisons of region representation methods by 
applying different feature descriptors. 


tion, we extract four statistics (i.e., Contrast, Correlation, En¬ 
ergy, Homogeneity) of GLCM [0 with four orientations (0 = 
0°, 45°, 90°, 13 5°), a nd one distance “1”, to represent the tex¬ 
ture feature f l (2.1). Three scales of regions (i.e., 64 x 64, 
128 x 128, 200 x 200) and step length 20 are used for slid¬ 
ing windows, and r = 0.6 and cr = 0.5 are used for spatial 
pruning. The experiments are carried out on a PC with Core 
17 3.4GHz CPU, and the average processing time for a 4-min 
CEUS video is about 100 seconds. 


We first report the sensitivities and mean accuracies of our 
method in differentiating benign and malignant FLLs in Ta¬ 
ble. [I] The average accuracy (89.7%) is comparable, if not 
superior, to the results reported in previous studies on smaller 
datasets dues .The second experiment in Table. [2] shows the 
effectiveness of our data-driven inference algorithm by alter¬ 
ing the procedure to determine the ROIs. Our data-driven 
inference algorithm (“DDI”) is compared with 1) “manual”: 
the ROI of each instance in the arterial phase is manually se¬ 
lected and the inference only performed in the portal and late 
phase; 2) “bruteforce”: the liver region is labeled and the op¬ 
timal ROIs are searched in the entire region of liver, without 
pruning; 3) “baseline”: the ROIs are randomly selected in 
the images of three phases. The results demonstrate that our 
fully automatic inference algorithm achieves comparable per¬ 
formance to the “manual” method, and performs better than 
“brute force” and “baseline”. Note that the performance of 
our algorithm on FNH is worse because the amount of train¬ 
ing data of FNH is relatively small. 

Finally, in Table [3] we compare the region representa¬ 
tion of our framework with other state-of-the-art methods: 
Multiple-ROI 0, ROW osterior Q3 and ROI out |Q3|. Each 
region representation is tested with three popular low-level 
features: GLCM, Law’s texture, and Local Phase, similar to 
lf3l . We manually select ROIs in three phases as required 
in previous works (note here we do not consider the perfor¬ 
mance of the inference algorithm), and use linear SVM as 
the classifier. The results show that our region representation 


obtains superior performances in general. 

4. CONCLUSIONS 

In this work we propose a fully automatic computational 
framework for characterizing different types of FLLs in 
CEUS, which efficiently combines the diverse information 
of spatial and temporal enhancement patterns. Besides, a 
weakly supervised learning algorithm is utilized, which alter¬ 
nates between inferring the latent variables (i.e. the locations 
of ROIs) and optimizing the model parameters. An efficient 
data-driven inference algorithm is then proposed to efficiently 
determine the optimal locations of ROIs. The results show 
promising classification accuracies and the potential of being 
developed for real-time clinical applications. In the future, 
a more interactive system will be developed to enable the 
radiologists to revise the diagnosis according to the detailed 
outputs of our algorithm (e.g., the locations of ROIs and the 
reference frames). 
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